1 Proactive Fault - Recovery in Distributed Systems

نویسنده

  • Soila M. Pertet
چکیده

Supporting both real-time and fault-tolerance properties in systems is challenging because real-time systems require predictable end-to-end schedules and bounded temporal behavior in order to meet task deadlines. However, system failures, which are typically unanticipated events, can disrupt the predefined real-time schedule and result in missed task deadlines. Such disruptions to the real-time schedule are aggravated in asynchronous distributed systems by two main factors: first, delays in failure detection, and second , increased latencies due to the reactive fault-recovery routines that are set into motion once a failure is detected. In this thesis, we present a general framework for proactive (rather than the classical reactive) fault-recovery that reduces the latencies incurred by the fault-recovery routines. Proactive fault-recovery is a technique that exploits fault prediction mechanisms in order to compensate for failures even before they occur, thereby providing bounded temporal behavior in real-time and fault-tolerant systems for certain classes of faults. In our framework, we also show how to exploit knowledge of the underlying system topology to apply the benefits of proactive fault-recovery to multi-tiered distributed systems. We evaluate the impact of the design choices we faced when implementing a prototype of this framework in a distributed CORBA application. Our preliminary results show a promising 76% reduction in the worst-case fault-recovery latencies in our application. This demonstrates that proactive fault-recovery can indeed provide bounded temporal behavior in the presence of certain kinds of faults, thereby facilitating the development of real-time, fault-tolerant distributed systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Influence of Fault Current Limiter in Voltage Drop and TRV Considering Wind Farm

Influence of distributed generation systems in the distribution systems can increase the level of short-circuit current. The effectiveness of distributed generation systems is affected by the size, location, type of distributed generation systems technology, and the methods of connecting to distribution systems. Wind turbine system is the examples of distributed generation source. Not only does...

متن کامل

Proactive Service Migration for Long-Running Byzantine Fault Tolerant Systems

In this paper, we describe a novel proactive recovery scheme based on service migration for long-running Byzantine fault tolerant systems. Proactive recovery is an essential method for ensuring long term reliability of fault tolerant systems that are under continuous threats from malicious adversaries. The primary benefit of our proactive recovery scheme is a reduced vulnerability window. This ...

متن کامل

Middleware for Embedded Adaptive Dependability

The Middleware for Embedded Adaptive Dependability (MEAD) infrastructure enhances large-scale distributed real-time embedded middleware applications with novel capabilities, including (i) transparent, yet tunable, fault tolerance in real time, (ii) proactive dependability, (iii) resource-aware system adaptation to crash, communication, partitioning and timing faults with (iv) scalable and fast ...

متن کامل

Reliable Broadcast in a Computational Hybrid Model with Byzantine Faults, Crashes, and Recoveries

This paper presents a formal model for asynchronous distributed systems with parties that exhibit Byzantine faults or that crash and subsequently recover. Motivated by practical considerations, it represents an intermediate step between crash-recovery models for distributed computing and proactive security methods for tolerating arbitrary faults. The model is computational and based on complexi...

متن کامل

Handling Cascading Failures: The Case for Topology-Aware Fault-Tolerance

Large distributed systems contain multiple components that can interact in sometimes unforeseen and complicated ways; this emergent “vulnerability of complexity” increases the likelihood of cascading failures that might result in widespread disruption. Our research explores whether we can exploit the knowledge of the system’s topology, the application’s interconnections and the application’s no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004